Extraction of Syntactic Translation Models from Parallel Data using Syntax from Source and Target Languages

نویسندگان

  • Vamshi Ambati
  • Alon Lavie
  • Jaime Carbonell
چکیده

We propose a generic rule induction framework that is informed by syntax from both sides of a parsed parallel corpus, as sets of structural, boundary and labeling related constraints. Factoring syntax in this manner empowers our framework to work with independent annotations coming from multiple resources and not necessarily a single syntactic structure. We then explore the issue of lexical coverage of translation models learned in different scenarios using syntax from one side vs. both sides. We specifically look at how the non-isomorphic nature of parse trees for the two languages affects coverage. We propose a novel technique for restructuring targetside parse trees, that generates alternate isomorphic target trees that preserve the syntactic boundaries of constituents that were aligned in the original parse trees. We also show that combining rules extracted by restructuring syntactic trees on both sides produces significantly better translation models. The improved precision and coverage of our syntax tables particularly fill in for the lack of lexical coverage in Syntax based Machine Translation approaches.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

مدل ترجمه عبارت-مرزی با استفاده از برچسب‌های کم‌عمق نحوی

Phrase-boundary model for statistical machine translation labels the rules with classes of boundary words on the target side phrases of training corpus. In this paper, we extend the phrase-boundary model using shallow syntactic labels including POS tags and chunk labels. With the priority of chunk labels, the proposed model names non-terminals with shallow syntactic labels on the boundaries of ...

متن کامل

Improving Syntax Driven Translation Models by Re-structuring Divergent and Non-isomorphic Parse Tree Structures

Syntax-based approaches to statistical MT require syntax-aware methods for acquiring their underlying translation models from parallel data. This acquisition process can be driven by syntactic trees for either the source or target language, or by trees on both sides. Work to date has demonstrated that using trees for both sides suffers from severe coverage problems. This is primarily due to the...

متن کامل

Machine Translation with Significant Word Reordering and Rich Target-Side Morphology

This paper describes the integration of morpho-syntactic information in phrase-based and syntax-based Machine Translation systems. We mainly focus on translating in the hard direction which is translating from morphologically poor to morphologically richer languages and also between language pairs that have significant word order differences. We intend to use hierarchical or surface syntactic m...

متن کامل

Cross Lingual Syntax Projection for Resource-Poor Languages

Over the past few decades, supervised learning in structured spaces has been quite successful in syntactic analysis problems in natural language processing. These learning techniques exploit large amounts of annotated data to learn models that can perform linguistic analysis on unseen data. Acquiring such supervised linguistic annotations for a language is important for natural language process...

متن کامل

Syntactic Reordering as Pre-processing Step in Statistical Machine Translation of English to Sesotho sa Leboa and Afrikaans

The output quality of statistical machine translation (SMT) depends to a large extent on the quantity and quality of the parallel corpora on which it is trained. In the case of resource-scarce languages where sufficiently large parallel corpora are not always available, alternative ways of improving the output quality of SMT systems must be sought. In this article, one such a method for improvi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009